docs(coding-agent/edit): document Unicode escape semantics in edit tool prompts#891
Open
apoc wants to merge 1 commit intocan1357:mainfrom
Open
docs(coding-agent/edit): document Unicode escape semantics in edit tool prompts#891apoc wants to merge 1 commit intocan1357:mainfrom
apoc wants to merge 1 commit intocan1357:mainfrom
Conversation
…ol prompts Tool-call JSON is parsed before any of the file-writing tools sees its content arguments, so JSON's native `\uXXXX` decoding already covers the "write a Unicode character" case: emitting `"\u2192"` (one backslash) in the JSON delivers `→` to the tool. To write the *literal* 6-char source sequence `\u2192` (e.g. JS regex `/\u2192/`, Python `r"\u2192"`, JSON fixtures, docs about Unicode), emit `"\\u2192"` (two backslashes) so the JSON parser delivers the 6 chars verbatim. Add a `<unicode-content>` section to the `write`, `replace`, `patch`, and `hashline` tool prompts spelling out both directions and an explicit "never emit two backslashes when you intend the character" rule. This prevents the common LLM mistake of double-escaping (which produces the literal text on disk) without requiring runtime decoding heuristics that would make literal-escape writing impossible.
Owner
|
This is a bit wasteful in terms of prompt space tbh, although I'm aware of the issue. |
Contributor
Author
|
Agree, I tried to fix it in tooling, but it had another issues :/ |
Contributor
Author
Contributor
Author
Checked the CC code and it looks like there is not special care neither for this kind of edits. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
This replaces #889, which proposed runtime decoding of
\uXXXXescape sequences in the edit/write pipeline. The Codex review on that PR (inline comment onnormalize.ts:237) correctly identified that runtime decoding is unsound: it makes it impossible to write a single literal 6-char\u2192source-code sequence to disk, because every viable JSON tool argument either decodes to the character or ends up with extra backslashes."\\u2192"(natural)\u2192(6 chars)→→❌"\\\\u2192"(escape attempt)\\u2192(7 chars)\\u2192❌That breaks JS regex source (
/\u2192/), Python raw strings (r"\u2192"), JSON fixtures, and any code that contains\uXXXXliterally.Approach
Document the convention in tool prompts instead of decoding at runtime. JSON natively decodes
\uXXXXalready, so:→— emit"\u2192"(one backslash) in the JSON, or the literal→character. Both arrive at the tool as→.\u2192(regex source, raw strings, fixtures) — emit"\\u2192"(two backslashes) so the JSON parser delivers the 6 chars verbatim."\\u2192"(two backslashes) when you intend the character. That writes literal text, not a Unicode character.Add a
<unicode-content>section to four tool prompts spelling out both directions and the negative example:write.md(content)replace.md(old_text/new_text)patch.md(diff—+lines andop:createpayloads)hashline.md(content)Files
Why this is better than runtime decoding
/\u2192/written to a.tsfile stays as 6 chars; the tool's job is to write what JSON delivered, not to second-guess it.Verification
bun run format-promptsclean (no formatting changes needed)xxd)